Histopathology Research Template 🔬


1 Introduction

  • State the marker of interest, the study objectives, and hypotheses (Knijn, Simmer, and Nagtegaal 2015).1

2 Materials & Methods

Describe Materials and Methods as highlighted in (Knijn, Simmer, and Nagtegaal 2015).2

  • Describe patient characteristics, and inclusion and exclusion criteria

  • Describe treatment details

  • Describe the type of material used

  • Specify how expression of the biomarker was assessed

  • Describe the number of independent (blinded) scorers and how they scored

  • State the method of case selection, study design, origin of the cases, and time frame

  • Describe the end of the follow-up period and median follow-up time

  • Define all clinical endpoints examined

  • Specify all applied statistical methods

  • Describe how interactions with other clinical/pathological factors were analyzed


2.1 Header Codes

Codes for general settings.3

Setup global chunk settings4

knitr::opts_chunk$set(
    eval = TRUE,
    echo = TRUE,
    fig.path = here::here("figs/"),
    message = FALSE,
    warning = FALSE,
    error = FALSE,
    cache = FALSE,
    comment = NA,
    tidy = TRUE,
    fig.width = 6,
    fig.height = 4
)

Load Library

see R/loadLibrary.R for the libraries loaded.

source(file = here::here("R", "loadLibrary.R"))

2.2 Generate Fake Data

Codes for generating fake data.5

Generate Fake Data

This code generates a fake histopathological data. Some sources for fake data generation here6 , here7 , here8 , here9 , here10 , here11 , here12 , here13 , and here14 .

Use this code to generate fake clinicopathologic data

source(file = here::here("R", "gc_fake_data.R"))
wakefield::table_heat(x = fakedata, palette = "Set1", flip = TRUE, print = TRUE)


2.3 Import Data

Codes for importing data.15

Read the data

library(readxl)
mydata <- readxl::read_excel(here::here("data", "mydata.xlsx"))
# View(mydata) # Use to view data after importing

Add code for import multiple data purrr reduce


2.4 Study Population

2.4.1 Report General Features

Codes for reporting general features.16

Dataframe Report

# Dataframe report
mydata %>% select(-contains("Date")) %>% report::report(.)
The data contains 250 observations of the following variables:
  - ID: 250 entries: 001, n = 1; 002, n = 1; 003, n = 1 and 247 others
  - Name: 249 entries: Aansh, n = 1; Abdurahmon, n = 1; Abrah, n = 1 and 246 others (1 missing)
  - Sex: 2 entries: Male, n = 126; Female, n = 123 (1 missing)
  - Age: Mean = 50.39, SD = 14.06, range = [25, 73], 1 missing
  - Race: 7 entries: White, n = 162; Hispanic, n = 39; Black, n = 26 and 4 others (1 missing)
  - PreinvasiveComponent: 2 entries: Absent, n = 186; Present, n = 63 (1 missing)
  - LVI: 2 entries: Absent, n = 157; Present, n = 92 (1 missing)
  - PNI: 2 entries: Absent, n = 171; Present, n = 78 (1 missing)
  - Death: 2 levels: FALSE (n = 73); TRUE (n = 176) and missing (n = 1)
  - Group: 2 entries: Control, n = 131; Treatment, n = 118 (1 missing)
  - Grade: 3 entries: 3, n = 101; 1, n = 83; 2, n = 65 (1 missing)
  - TStage: 4 entries: 4, n = 101; 3, n = 66; 2, n = 50 and 1 other (1 missing)
  - Anti-X-intensity: Mean = 2.43, SD = 0.64, range = [1, 3], 1 missing
  - Anti-Y-intensity: Mean = 2.06, SD = 0.78, range = [1, 3], 1 missing
  - LymphNodeMetastasis: 2 entries: Absent, n = 153; Present, n = 96 (1 missing)
  - Valid: 2 levels: FALSE (n = 137); TRUE (n = 112) and missing (n = 1)
  - Smoker: 2 levels: FALSE (n = 125); TRUE (n = 124) and missing (n = 1)
  - Grade_Level: 3 entries: high, n = 100; low, n = 77; moderate, n = 72 (1 missing)
  - DeathTime: 2 entries: Within1Year, n = 149; MoreThan1Year, n = 101
mydata %>% explore::describe_tbl()
250 observations with 21 variables
19 variables containing missings (NA)
0 variables with no variance

2.5 Ethics and IRB

2.5.1 Always Respect Patient Privacy

Always Respect Patient Privacy
- Health Information Privacy17
- Kişisel Verilerin Korunması18


2.6 Define Variable Types

Codes for defining variable types.19

2.6.1 Find Key Columns

print column names as vector

dput(names(mydata))
c("ID", "Name", "Sex", "Age", "Race", "PreinvasiveComponent", 
"LVI", "PNI", "LastFollowUpDate", "Death", "Group", "Grade", 
"TStage", "Anti-X-intensity", "Anti-Y-intensity", "LymphNodeMetastasis", 
"Valid", "Smoker", "Grade_Level", "SurgeryDate", "DeathTime")

2.6.1.1 Find ID and key columns to exclude from analysis

See the code as function in R/find_key.R.

keycolumns <- mydata %>% sapply(., FUN = dataMaid::isKey) %>% as_tibble() %>% select(which(.[1, 
    ] == TRUE)) %>% names()
keycolumns
[1] "ID"   "Name"

2.6.2 Variable Types

Get variable types

mydata %>% select(-keycolumns) %>% inspectdf::inspect_types()
# A tibble: 4 x 4
  type             cnt  pcnt col_name  
  <chr>          <int> <dbl> <list>    
1 character         11  57.9 <chr [11]>
2 logical            3  15.8 <chr [3]> 
3 numeric            3  15.8 <chr [3]> 
4 POSIXct POSIXt     2  10.5 <chr [2]> 
mydata %>% select(-keycolumns, -contains("Date")) %>% describer::describe() %>% knitr::kable(format = "markdown")
.column_name .column_class .column_type .count_elements .mean_value .sd_value .q0_value .q25_value .q50_value .q75_value .q100_value
Sex character character 250 NA NA Female NA NA NA Male
Age numeric double 250 50.389558 14.0570859 25 38 50 63 73
Race character character 250 NA NA Asian NA NA NA White
PreinvasiveComponent character character 250 NA NA Absent NA NA NA Present
LVI character character 250 NA NA Absent NA NA NA Present
PNI character character 250 NA NA Absent NA NA NA Present
Death logical logical 250 NA NA FALSE NA NA NA TRUE
Group character character 250 NA NA Control NA NA NA Treatment
Grade character character 250 NA NA 1 NA NA NA 3
TStage character character 250 NA NA 1 NA NA NA 4
Anti-X-intensity numeric double 250 2.429719 0.6382312 1 2 3 3 3
Anti-Y-intensity numeric double 250 2.060241 0.7779636 1 1 2 3 3
LymphNodeMetastasis character character 250 NA NA Absent NA NA NA Present
Valid logical logical 250 NA NA FALSE NA NA NA TRUE
Smoker logical logical 250 NA NA FALSE NA NA NA TRUE
Grade_Level character character 250 NA NA high NA NA NA moderate
DeathTime character character 250 NA NA MoreThan1Year NA NA NA Within1Year

Plot variable types

mydata %>% select(-keycolumns) %>% inspectdf::inspect_types() %>% inspectdf::show_plot()

# https://github.com/ropensci/visdat
# http://visdat.njtierney.com/articles/using_visdat.html
# https://cran.r-project.org/web/packages/visdat/index.html
# http://visdat.njtierney.com/

# visdat::vis_guess(mydata)

visdat::vis_dat(mydata)

mydata %>% explore::explore_tbl()

2.6.3 Define Variable Types

2.6.3.1 Find character variables

characterVariables <- mydata %>% select(-keycolumns) %>% inspectdf::inspect_types() %>% 
    dplyr::filter(type == "character") %>% dplyr::select(col_name) %>% pull() %>% 
    unlist()

characterVariables
 [1] "Sex"                  "Race"                 "PreinvasiveComponent"
 [4] "LVI"                  "PNI"                  "Group"               
 [7] "Grade"                "TStage"               "LymphNodeMetastasis" 
[10] "Grade_Level"          "DeathTime"           

2.6.3.2 Find categorical variables

categoricalVariables <- mydata %>% dplyr::select(-keycolumns, -contains("Date")) %>% 
    describer::describe() %>% janitor::clean_names() %>% dplyr::filter(column_type == 
    "factor") %>% dplyr::select(column_name) %>% dplyr::pull()

categoricalVariables
character(0)

2.6.3.3 Find continious variables

continiousVariables <- mydata %>% dplyr::select(-keycolumns, -contains("Date")) %>% 
    describer::describe() %>% janitor::clean_names() %>% dplyr::filter(column_type == 
    "numeric" | column_type == "double") %>% dplyr::select(column_name) %>% dplyr::pull()

continiousVariables
[1] "Age"              "Anti-X-intensity" "Anti-Y-intensity"

2.6.3.4 Find numeric variables

numericVariables <- mydata %>% select(-keycolumns) %>% inspectdf::inspect_types() %>% 
    dplyr::filter(type == "numeric") %>% dplyr::select(col_name) %>% pull() %>% unlist()

numericVariables
[1] "Age"              "Anti-X-intensity" "Anti-Y-intensity"

2.6.3.5 Find integer variables

integerVariables <- mydata %>% select(-keycolumns) %>% inspectdf::inspect_types() %>% 
    dplyr::filter(type == "integer") %>% dplyr::select(col_name) %>% pull() %>% unlist()

integerVariables
NULL

2.6.3.6 Find list variables

listVariables <- mydata %>% select(-keycolumns) %>% inspectdf::inspect_types() %>% 
    dplyr::filter(type == "list") %>% dplyr::select(col_name) %>% pull() %>% unlist()
listVariables
NULL

2.6.3.7 Find date variables

is_date <- function(x) inherits(x, c("POSIXct", "POSIXt"))

dateVariables <- names(which(sapply(mydata, FUN = is_date) == TRUE))
dateVariables
[1] "LastFollowUpDate" "SurgeryDate"     

2.7 Overview the Data

Codes for overviewing the data.20

2.7.1 View Data

View(mydata)
reactable::reactable(data = mydata, sortable = TRUE, resizable = TRUE, filterable = TRUE, 
    searchable = TRUE, pagination = TRUE, paginationType = "numbers", showPageSizeOptions = TRUE, 
    highlight = TRUE, striped = TRUE, outlined = TRUE, compact = TRUE, wrap = FALSE, 
    showSortIcon = TRUE, showSortable = TRUE)

2.7.2 Overview / Exploratory Data Analysis (EDA)

Summary of Data via summarytools 📦

summarytools::view(summarytools::dfSummary(mydata %>% select(-keycolumns)))
if (!dir.exists(here::here("out"))) {
    dir.create(here::here("out"))
}

summarytools::view(x = summarytools::dfSummary(mydata %>% select(-keycolumns)), file = here::here("out", 
    "mydata_summary.html"))

Summary via dataMaid 📦

if (!dir.exists(here::here("out"))) {
    dir.create(here::here("out"))
}

dataMaid::makeDataReport(data = mydata, file = here::here("out", "dataMaid_mydata.Rmd"), 
    replace = TRUE, openResult = FALSE, render = FALSE, quiet = TRUE)

Summary via explore 📦

if (!dir.exists(here::here("out"))) {
    dir.create(here::here("out"))
}

mydata %>% select(-dateVariables) %>% explore::report(output_file = "mydata_report.html", 
    output_dir = here::here("out"))

Glimpse of Data

glimpse(mydata %>% select(-keycolumns, -dateVariables))
Observations: 250
Variables: 17
$ Sex                  <chr> "Male", "Male", "Female", "Male", "Female", "Mal…
$ Age                  <dbl> 54, 60, 30, 56, 45, 69, 61, 60, 39, 65, 63, 72, …
$ Race                 <chr> "Hispanic", "White", "White", "White", "White", …
$ PreinvasiveComponent <chr> "Absent", "Present", "Absent", "Present", "Prese…
$ LVI                  <chr> "Present", "Present", "Absent", "Present", "Abse…
$ PNI                  <chr> "Absent", "Absent", "Present", "Absent", "Presen…
$ Death                <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE…
$ Group                <chr> "Control", "Control", "Control", "Control", "Con…
$ Grade                <chr> "2", "3", "3", "3", "2", "1", "1", "3", "1", "1"…
$ TStage               <chr> "3", "3", "2", "1", "4", "4", "4", "3", "2", "1"…
$ `Anti-X-intensity`   <dbl> 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 1, 2, 3, 1, …
$ `Anti-Y-intensity`   <dbl> 2, 2, 2, 2, 1, 2, 3, 1, 3, 2, 2, 3, 3, 1, 3, 3, …
$ LymphNodeMetastasis  <chr> "Absent", "Absent", "Absent", "Absent", "Absent"…
$ Valid                <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ Smoker               <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, T…
$ Grade_Level          <chr> "moderate", "high", "moderate", "moderate", "low…
$ DeathTime            <chr> "Within1Year", "Within1Year", "Within1Year", "Wi…
mydata %>% explore::describe()
               variable type na na_pct unique min  mean max
1                    ID  chr  0    0.0    250  NA    NA  NA
2                  Name  chr  1    0.4    250  NA    NA  NA
3                   Sex  chr  1    0.4      3  NA    NA  NA
4                   Age  dbl  1    0.4     50  25 50.39  73
5                  Race  chr  1    0.4      8  NA    NA  NA
6  PreinvasiveComponent  chr  1    0.4      3  NA    NA  NA
7                   LVI  chr  1    0.4      3  NA    NA  NA
8                   PNI  chr  1    0.4      3  NA    NA  NA
9      LastFollowUpDate  dat  1    0.4     13  NA    NA  NA
10                Death  lgl  1    0.4      3   0  0.71   1
11                Group  chr  1    0.4      3  NA    NA  NA
12                Grade  chr  1    0.4      4  NA    NA  NA
13               TStage  chr  1    0.4      5  NA    NA  NA
14     Anti-X-intensity  dbl  1    0.4      4   1  2.43   3
15     Anti-Y-intensity  dbl  1    0.4      4   1  2.06   3
16  LymphNodeMetastasis  chr  1    0.4      3  NA    NA  NA
17                Valid  lgl  1    0.4      3   0  0.45   1
18               Smoker  lgl  1    0.4      3   0  0.50   1
19          Grade_Level  chr  1    0.4      4  NA    NA  NA
20          SurgeryDate  dat  1    0.4    221  NA    NA  NA
21            DeathTime  chr  0    0.0      2  NA    NA  NA

Explore

explore::explore(mydata)

2.7.3 Control Data

Control Data if matching expectations

visdat::vis_expect(data = mydata, expectation = ~.x == -1, show_perc = TRUE)

visdat::vis_expect(mydata, ~.x >= 25)

See missing values

visdat::vis_miss(airquality, cluster = TRUE)

visdat::vis_miss(airquality, sort_miss = TRUE)

xray::anomalies(mydata)
$variables
               Variable   q qNA  pNA qZero pZero qBlank pBlank qInf pInf
1                 Valid 250   1 0.4%   137 54.8%      0      -    0    -
2                Smoker 250   1 0.4%   125   50%      0      -    0    -
3                 Death 250   1 0.4%    73 29.2%      0      -    0    -
4                   Sex 250   1 0.4%     0     -      0      -    0    -
5  PreinvasiveComponent 250   1 0.4%     0     -      0      -    0    -
6                   LVI 250   1 0.4%     0     -      0      -    0    -
7                   PNI 250   1 0.4%     0     -      0      -    0    -
8                 Group 250   1 0.4%     0     -      0      -    0    -
9   LymphNodeMetastasis 250   1 0.4%     0     -      0      -    0    -
10                Grade 250   1 0.4%     0     -      0      -    0    -
11     Anti-X-intensity 250   1 0.4%     0     -      0      -    0    -
12     Anti-Y-intensity 250   1 0.4%     0     -      0      -    0    -
13          Grade_Level 250   1 0.4%     0     -      0      -    0    -
14               TStage 250   1 0.4%     0     -      0      -    0    -
15                 Race 250   1 0.4%     0     -      0      -    0    -
16     LastFollowUpDate 250   1 0.4%     0     -      0      -    0    -
17                  Age 250   1 0.4%     0     -      0      -    0    -
18          SurgeryDate 250   1 0.4%     0     -      0      -    0    -
19                 Name 250   1 0.4%     0     -      0      -    0    -
20            DeathTime 250   0    -     0     -      0      -    0    -
21                   ID 250   0    -     0     -      0      -    0    -
   qDistinct      type anomalous_percent
1          3   Logical             55.2%
2          3   Logical             50.4%
3          3   Logical             29.6%
4          3 Character              0.4%
5          3 Character              0.4%
6          3 Character              0.4%
7          3 Character              0.4%
8          3 Character              0.4%
9          3 Character              0.4%
10         4 Character              0.4%
11         4   Numeric              0.4%
12         4   Numeric              0.4%
13         4 Character              0.4%
14         5 Character              0.4%
15         8 Character              0.4%
16        13 Timestamp              0.4%
17        50   Numeric              0.4%
18       221 Timestamp              0.4%
19       250 Character              0.4%
20         2 Character                 -
21       250 Character                 -

$problem_variables
 [1] Variable          q                 qNA               pNA              
 [5] qZero             pZero             qBlank            pBlank           
 [9] qInf              pInf              qDistinct         type             
[13] anomalous_percent problems         
<0 rows> (or 0-length row.names)
xray::distributions(mydata)
================================================================================

[1] "Ignoring variable LastFollowUpDate: Unsupported type for visualization."

[1] "Ignoring variable SurgeryDate: Unsupported type for visualization."

          Variable p_1 p_10 p_25 p_50 p_75 p_90 p_99
1 Anti-X-intensity   1    2    2    3    3    3    3
2 Anti-Y-intensity   1    1    1    2    3    3    3
3              Age  26   31   38   50   63   70   73

2.7.4 Explore Data

Summary of Data via DataExplorer 📦

DataExplorer::plot_str(mydata)
DataExplorer::plot_str(mydata, type = "r")
DataExplorer::introduce(mydata)
# A tibble: 1 x 9
   rows columns discrete_columns continuous_colu… all_missing_col…
  <int>   <int>            <int>            <int>            <int>
1   250      21               18                3                0
# … with 4 more variables: total_missing_values <int>, complete_rows <int>,
#   total_observations <int>, memory_usage <dbl>
DataExplorer::plot_intro(mydata)

DataExplorer::plot_missing(mydata)

Drop columns

mydata2 <- DataExplorer::drop_columns(mydata, "TStage")
DataExplorer::plot_bar(mydata)

DataExplorer::plot_bar(mydata, with = "Death")

DataExplorer::plot_histogram(mydata)



3 Statistical Analysis

Learn these tests as highlighted in (Schmidt et al. 2017).21


4 Results

Write results as described in (Knijn, Simmer, and Nagtegaal 2015)22

  • Describe the number of patients included in the analysis and reason for dropout

  • Report patient/disease characteristics (including the biomarker of interest) with the number of missing values

  • Describe the interaction of the biomarker of interest with established prognostic variables

  • Include at least 90 % of initial cases included in univariate and multivariate analyses

  • Report the estimated effect (relative risk/odds ratio, confidence interval, and p value) in univariate analysis

  • Report the estimated effect (hazard rate/odds ratio, confidence interval, and p value) in multivariate analysis

  • Report the estimated effects (hazard ratio/odds ratio, confidence interval, and p value) of other prognostic factors included in multivariate analysis


4.1 Descriptive Statistics

Codes for Descriptive Statistics.23

4.1.1 Table One

Report Data properties via report 📦

mydata %>% dplyr::select(-dplyr::contains("Date")) %>% report::report()
The data contains 250 observations of the following variables:
  - ID: 250 entries: 001, n = 1; 002, n = 1; 003, n = 1 and 247 others
  - Name: 249 entries: Aansh, n = 1; Abdurahmon, n = 1; Abrah, n = 1 and 246 others (1 missing)
  - Sex: 2 entries: Male, n = 126; Female, n = 123 (1 missing)
  - Age: Mean = 50.39, SD = 14.06, range = [25, 73], 1 missing
  - Race: 7 entries: White, n = 162; Hispanic, n = 39; Black, n = 26 and 4 others (1 missing)
  - PreinvasiveComponent: 2 entries: Absent, n = 186; Present, n = 63 (1 missing)
  - LVI: 2 entries: Absent, n = 157; Present, n = 92 (1 missing)
  - PNI: 2 entries: Absent, n = 171; Present, n = 78 (1 missing)
  - Death: 2 levels: FALSE (n = 73); TRUE (n = 176) and missing (n = 1)
  - Group: 2 entries: Control, n = 131; Treatment, n = 118 (1 missing)
  - Grade: 3 entries: 3, n = 101; 1, n = 83; 2, n = 65 (1 missing)
  - TStage: 4 entries: 4, n = 101; 3, n = 66; 2, n = 50 and 1 other (1 missing)
  - Anti-X-intensity: Mean = 2.43, SD = 0.64, range = [1, 3], 1 missing
  - Anti-Y-intensity: Mean = 2.06, SD = 0.78, range = [1, 3], 1 missing
  - LymphNodeMetastasis: 2 entries: Absent, n = 153; Present, n = 96 (1 missing)
  - Valid: 2 levels: FALSE (n = 137); TRUE (n = 112) and missing (n = 1)
  - Smoker: 2 levels: FALSE (n = 125); TRUE (n = 124) and missing (n = 1)
  - Grade_Level: 3 entries: high, n = 100; low, n = 77; moderate, n = 72 (1 missing)
  - DeathTime: 2 entries: Within1Year, n = 149; MoreThan1Year, n = 101

Table 1 via arsenal 📦

# cat(names(mydata), sep = ' + \n')
library(arsenal)
tab1 <- arsenal::tableby(~Sex + Age + Race + PreinvasiveComponent + LVI + PNI + Death + 
    Group + Grade + TStage + `Anti-X-intensity` + `Anti-Y-intensity` + LymphNodeMetastasis + 
    Valid + Smoker + Grade_Level, data = mydata)
summary(tab1)
Overall (N=250)
Sex
   N-Miss 1
   Female 123 (49.4%)
   Male 126 (50.6%)
Age
   N-Miss 1
   Mean (SD) 50.390 (14.057)
   Range 25.000 - 73.000
Race
   N-Miss 1
   Asian 15 (6.0%)
   Bi-Racial 4 (1.6%)
   Black 26 (10.4%)
   Hispanic 39 (15.7%)
   Native 2 (0.8%)
   Other 1 (0.4%)
   White 162 (65.1%)
PreinvasiveComponent
   N-Miss 1
   Absent 186 (74.7%)
   Present 63 (25.3%)
LVI
   N-Miss 1
   Absent 157 (63.1%)
   Present 92 (36.9%)
PNI
   N-Miss 1
   Absent 171 (68.7%)
   Present 78 (31.3%)
Death
   N-Miss 1
   FALSE 73 (29.3%)
   TRUE 176 (70.7%)
Group
   N-Miss 1
   Control 131 (52.6%)
   Treatment 118 (47.4%)
Grade
   N-Miss 1
   1 83 (33.3%)
   2 65 (26.1%)
   3 101 (40.6%)
TStage
   N-Miss 1
   1 32 (12.9%)
   2 50 (20.1%)
   3 66 (26.5%)
   4 101 (40.6%)
Anti-X-intensity
   N-Miss 1
   Mean (SD) 2.430 (0.638)
   Range 1.000 - 3.000
Anti-Y-intensity
   N-Miss 1
   Mean (SD) 2.060 (0.778)
   Range 1.000 - 3.000
LymphNodeMetastasis
   N-Miss 1
   Absent 153 (61.4%)
   Present 96 (38.6%)
Valid
   N-Miss 1
   FALSE 137 (55.0%)
   TRUE 112 (45.0%)
Smoker
   N-Miss 1
   FALSE 125 (50.2%)
   TRUE 124 (49.8%)
Grade_Level
   N-Miss 1
   high 100 (40.2%)
   low 77 (30.9%)
   moderate 72 (28.9%)

Table 1 via tableone 📦

library(tableone)
mydata %>% select(-keycolumns, -dateVariables) %>% tableone::CreateTableOne(data = .)
                                    
                                     Overall      
  n                                    250        
  Sex = Male (%)                       126 (50.6) 
  Age (mean (SD))                    50.39 (14.06)
  Race (%)                                        
     Asian                              15 ( 6.0) 
     Bi-Racial                           4 ( 1.6) 
     Black                              26 (10.4) 
     Hispanic                           39 (15.7) 
     Native                              2 ( 0.8) 
     Other                               1 ( 0.4) 
     White                             162 (65.1) 
  PreinvasiveComponent = Present (%)    63 (25.3) 
  LVI = Present (%)                     92 (36.9) 
  PNI = Present (%)                     78 (31.3) 
  Death = TRUE (%)                     176 (70.7) 
  Group = Treatment (%)                118 (47.4) 
  Grade (%)                                       
     1                                  83 (33.3) 
     2                                  65 (26.1) 
     3                                 101 (40.6) 
  TStage (%)                                      
     1                                  32 (12.9) 
     2                                  50 (20.1) 
     3                                  66 (26.5) 
     4                                 101 (40.6) 
  Anti-X-intensity (mean (SD))        2.43 (0.64) 
  Anti-Y-intensity (mean (SD))        2.06 (0.78) 
  LymphNodeMetastasis = Present (%)     96 (38.6) 
  Valid = TRUE (%)                     112 (45.0) 
  Smoker = TRUE (%)                    124 (49.8) 
  Grade_Level (%)                                 
     high                              100 (40.2) 
     low                                77 (30.9) 
     moderate                           72 (28.9) 
  DeathTime = Within1Year (%)          149 (59.6) 

Descriptive Statistics of Continuous Variables

mydata %>% select(continiousVariables, numericVariables, integerVariables) %>% summarytools::descr(., 
    style = "rmarkdown")
print(summarytools::descr(mydata), method = "render", table.classes = "st-small")
mydata %>% summarytools::descr(., stats = "common", transpose = TRUE, headings = FALSE)
mydata %>% summarytools::descr(stats = "common") %>% summarytools::tb()
mydata$Sex %>% summarytools::freq(cumul = FALSE, report.nas = FALSE) %>% summarytools::tb()
mydata %>% explore::describe() %>% dplyr::filter(unique < 5)
               variable type na na_pct unique min mean max
1                   Sex  chr  1    0.4      3  NA   NA  NA
2  PreinvasiveComponent  chr  1    0.4      3  NA   NA  NA
3                   LVI  chr  1    0.4      3  NA   NA  NA
4                   PNI  chr  1    0.4      3  NA   NA  NA
5                 Death  lgl  1    0.4      3   0 0.71   1
6                 Group  chr  1    0.4      3  NA   NA  NA
7                 Grade  chr  1    0.4      4  NA   NA  NA
8      Anti-X-intensity  dbl  1    0.4      4   1 2.43   3
9      Anti-Y-intensity  dbl  1    0.4      4   1 2.06   3
10  LymphNodeMetastasis  chr  1    0.4      3  NA   NA  NA
11                Valid  lgl  1    0.4      3   0 0.45   1
12               Smoker  lgl  1    0.4      3   0 0.50   1
13          Grade_Level  chr  1    0.4      4  NA   NA  NA
14            DeathTime  chr  0    0.0      2  NA   NA  NA
mydata %>% explore::describe() %>% dplyr::filter(na > 0)
               variable type na na_pct unique min  mean max
1                  Name  chr  1    0.4    250  NA    NA  NA
2                   Sex  chr  1    0.4      3  NA    NA  NA
3                   Age  dbl  1    0.4     50  25 50.39  73
4                  Race  chr  1    0.4      8  NA    NA  NA
5  PreinvasiveComponent  chr  1    0.4      3  NA    NA  NA
6                   LVI  chr  1    0.4      3  NA    NA  NA
7                   PNI  chr  1    0.4      3  NA    NA  NA
8      LastFollowUpDate  dat  1    0.4     13  NA    NA  NA
9                 Death  lgl  1    0.4      3   0  0.71   1
10                Group  chr  1    0.4      3  NA    NA  NA
11                Grade  chr  1    0.4      4  NA    NA  NA
12               TStage  chr  1    0.4      5  NA    NA  NA
13     Anti-X-intensity  dbl  1    0.4      4   1  2.43   3
14     Anti-Y-intensity  dbl  1    0.4      4   1  2.06   3
15  LymphNodeMetastasis  chr  1    0.4      3  NA    NA  NA
16                Valid  lgl  1    0.4      3   0  0.45   1
17               Smoker  lgl  1    0.4      3   0  0.50   1
18          Grade_Level  chr  1    0.4      4  NA    NA  NA
19          SurgeryDate  dat  1    0.4    221  NA    NA  NA
mydata %>% explore::describe()
               variable type na na_pct unique min  mean max
1                    ID  chr  0    0.0    250  NA    NA  NA
2                  Name  chr  1    0.4    250  NA    NA  NA
3                   Sex  chr  1    0.4      3  NA    NA  NA
4                   Age  dbl  1    0.4     50  25 50.39  73
5                  Race  chr  1    0.4      8  NA    NA  NA
6  PreinvasiveComponent  chr  1    0.4      3  NA    NA  NA
7                   LVI  chr  1    0.4      3  NA    NA  NA
8                   PNI  chr  1    0.4      3  NA    NA  NA
9      LastFollowUpDate  dat  1    0.4     13  NA    NA  NA
10                Death  lgl  1    0.4      3   0  0.71   1
11                Group  chr  1    0.4      3  NA    NA  NA
12                Grade  chr  1    0.4      4  NA    NA  NA
13               TStage  chr  1    0.4      5  NA    NA  NA
14     Anti-X-intensity  dbl  1    0.4      4   1  2.43   3
15     Anti-Y-intensity  dbl  1    0.4      4   1  2.06   3
16  LymphNodeMetastasis  chr  1    0.4      3  NA    NA  NA
17                Valid  lgl  1    0.4      3   0  0.45   1
18               Smoker  lgl  1    0.4      3   0  0.50   1
19          Grade_Level  chr  1    0.4      4  NA    NA  NA
20          SurgeryDate  dat  1    0.4    221  NA    NA  NA
21            DeathTime  chr  0    0.0      2  NA    NA  NA

4.1.2 Categorical Variables

Use R/gc_desc_cat.R to generate gc_desc_cat.Rmd containing descriptive statistics for categorical variables

source(here::here("R", "gc_desc_cat.R"))

4.1.2.1 Descriptive Statistics Sex

mydata %>% janitor::tabyl(Sex) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
Sex n percent valid_percent
Female 123 49.2% 49.4%
Male 126 50.4% 50.6%
NA 1 0.4% -

4.1.2.2 Descriptive Statistics Race

mydata %>% janitor::tabyl(Race) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
Race n percent valid_percent
Asian 15 6.0% 6.0%
Bi-Racial 4 1.6% 1.6%
Black 26 10.4% 10.4%
Hispanic 39 15.6% 15.7%
Native 2 0.8% 0.8%
Other 1 0.4% 0.4%
White 162 64.8% 65.1%
NA 1 0.4% -

4.1.2.3 Descriptive Statistics PreinvasiveComponent

mydata %>% janitor::tabyl(PreinvasiveComponent) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
PreinvasiveComponent n percent valid_percent
Absent 186 74.4% 74.7%
Present 63 25.2% 25.3%
NA 1 0.4% -

4.1.2.4 Descriptive Statistics LVI

mydata %>% janitor::tabyl(LVI) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
LVI n percent valid_percent
Absent 157 62.8% 63.1%
Present 92 36.8% 36.9%
NA 1 0.4% -

4.1.2.5 Descriptive Statistics PNI

mydata %>% janitor::tabyl(PNI) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
PNI n percent valid_percent
Absent 171 68.4% 68.7%
Present 78 31.2% 31.3%
NA 1 0.4% -

4.1.2.6 Descriptive Statistics Group

mydata %>% janitor::tabyl(Group) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
Group n percent valid_percent
Control 131 52.4% 52.6%
Treatment 118 47.2% 47.4%
NA 1 0.4% -

4.1.2.7 Descriptive Statistics Grade

mydata %>% janitor::tabyl(Grade) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
Grade n percent valid_percent
1 83 33.2% 33.3%
2 65 26.0% 26.1%
3 101 40.4% 40.6%
NA 1 0.4% -

4.1.2.8 Descriptive Statistics TStage

mydata %>% janitor::tabyl(TStage) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
TStage n percent valid_percent
1 32 12.8% 12.9%
2 50 20.0% 20.1%
3 66 26.4% 26.5%
4 101 40.4% 40.6%
NA 1 0.4% -

4.1.2.9 Descriptive Statistics LymphNodeMetastasis

mydata %>% janitor::tabyl(LymphNodeMetastasis) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
LymphNodeMetastasis n percent valid_percent
Absent 153 61.2% 61.4%
Present 96 38.4% 38.6%
NA 1 0.4% -

4.1.2.10 Descriptive Statistics Grade_Level

mydata %>% janitor::tabyl(Grade_Level) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
Grade_Level n percent valid_percent
high 100 40.0% 40.2%
low 77 30.8% 30.9%
moderate 72 28.8% 28.9%
NA 1 0.4% -

4.1.2.11 Descriptive Statistics DeathTime

mydata %>% janitor::tabyl(DeathTime) %>% janitor::adorn_pct_formatting(rounding = "half up", 
    digits = 1) %>% knitr::kable()
DeathTime n percent
MoreThan1Year 101 40.4%
Within1Year 149 59.6%
race_stats <- summarytools::freq(mydata$Race)
print(race_stats, report.nas = FALSE, totals = FALSE, display.type = FALSE, Variable.label = "Race Group")
mydata %>% explore::describe(PreinvasiveComponent)
variable = PreinvasiveComponent
type     = character
na       = 1 of 250 (0.4%)
unique   = 3
 Absent  = 186 (74.4%)
 Present = 63 (25.2%)
 NA      = 1 (0.4%)
## Frequency or custom tables for categorical variables
SmartEDA::ExpCTable(mydata, Target = NULL, margin = 1, clim = 10, nlim = 5, round = 2, 
    bin = NULL, per = T)
               Variable         Valid Frequency Percent CumPercent
1                   Sex        Female       123    49.2       49.2
2                   Sex          Male       126    50.4       99.6
3                   Sex            NA         1     0.4      100.0
4                   Sex         TOTAL       250      NA         NA
5                  Race         Asian        15     6.0        6.0
6                  Race     Bi-Racial         4     1.6        7.6
7                  Race         Black        26    10.4       18.0
8                  Race      Hispanic        39    15.6       33.6
9                  Race            NA         1     0.4       34.0
10                 Race        Native         2     0.8       34.8
11                 Race         Other         1     0.4       35.2
12                 Race         White       162    64.8      100.0
13                 Race         TOTAL       250      NA         NA
14 PreinvasiveComponent        Absent       186    74.4       74.4
15 PreinvasiveComponent            NA         1     0.4       74.8
16 PreinvasiveComponent       Present        63    25.2      100.0
17 PreinvasiveComponent         TOTAL       250      NA         NA
18                  LVI        Absent       157    62.8       62.8
19                  LVI            NA         1     0.4       63.2
20                  LVI       Present        92    36.8      100.0
21                  LVI         TOTAL       250      NA         NA
22                  PNI        Absent       171    68.4       68.4
23                  PNI            NA         1     0.4       68.8
24                  PNI       Present        78    31.2      100.0
25                  PNI         TOTAL       250      NA         NA
26                Group       Control       131    52.4       52.4
27                Group            NA         1     0.4       52.8
28                Group     Treatment       118    47.2      100.0
29                Group         TOTAL       250      NA         NA
30                Grade             1        83    33.2       33.2
31                Grade             2        65    26.0       59.2
32                Grade             3       101    40.4       99.6
33                Grade            NA         1     0.4      100.0
34                Grade         TOTAL       250      NA         NA
35               TStage             1        32    12.8       12.8
36               TStage             2        50    20.0       32.8
37               TStage             3        66    26.4       59.2
38               TStage             4       101    40.4       99.6
39               TStage            NA         1     0.4      100.0
40               TStage         TOTAL       250      NA         NA
41  LymphNodeMetastasis        Absent       153    61.2       61.2
42  LymphNodeMetastasis            NA         1     0.4       61.6
43  LymphNodeMetastasis       Present        96    38.4      100.0
44  LymphNodeMetastasis         TOTAL       250      NA         NA
45          Grade_Level          high       100    40.0       40.0
46          Grade_Level           low        77    30.8       70.8
47          Grade_Level      moderate        72    28.8       99.6
48          Grade_Level            NA         1     0.4      100.0
49          Grade_Level         TOTAL       250      NA         NA
50            DeathTime MoreThan1Year       101    40.4       40.4
51            DeathTime   Within1Year       149    59.6      100.0
52            DeathTime         TOTAL       250      NA         NA
53     Anti-X-intensity             1        20     8.0        8.0
54     Anti-X-intensity             2       102    40.8       48.8
55     Anti-X-intensity             3       127    50.8       99.6
56     Anti-X-intensity            NA         1     0.4      100.0
57     Anti-X-intensity         TOTAL       250      NA         NA
58     Anti-Y-intensity             1        68    27.2       27.2
59     Anti-Y-intensity             2        98    39.2       66.4
60     Anti-Y-intensity             3        83    33.2       99.6
61     Anti-Y-intensity            NA         1     0.4      100.0
62     Anti-Y-intensity         TOTAL       250      NA         NA
inspectdf::inspect_cat(mydata)
# A tibble: 16 x 5
   col_name               cnt common      common_pcnt levels            
   <chr>                <int> <chr>             <dbl> <named list>      
 1 Death                    3 TRUE               70.4 <tibble [3 × 3]>  
 2 DeathTime                2 Within1Year        59.6 <tibble [2 × 3]>  
 3 Grade                    4 3                  40.4 <tibble [4 × 3]>  
 4 Grade_Level              4 high               40   <tibble [4 × 3]>  
 5 Group                    3 Control            52.4 <tibble [3 × 3]>  
 6 ID                     250 001                 0.4 <tibble [250 × 3]>
 7 LVI                      3 Absent             62.8 <tibble [3 × 3]>  
 8 LymphNodeMetastasis      3 Absent             61.2 <tibble [3 × 3]>  
 9 Name                   250 Aansh               0.4 <tibble [250 × 3]>
10 PNI                      3 Absent             68.4 <tibble [3 × 3]>  
11 PreinvasiveComponent     3 Absent             74.4 <tibble [3 × 3]>  
12 Race                     8 White              64.8 <tibble [8 × 3]>  
13 Sex                      3 Male               50.4 <tibble [3 × 3]>  
14 Smoker                   3 FALSE              50   <tibble [3 × 3]>  
15 TStage                   5 4                  40.4 <tibble [5 × 3]>  
16 Valid                    3 FALSE              54.8 <tibble [3 × 3]>  
inspectdf::inspect_cat(mydata)$levels$Group
# A tibble: 3 x 3
  value      prop   cnt
  <chr>     <dbl> <int>
1 Control   0.524   131
2 Treatment 0.472   118
3 <NA>      0.004     1

4.1.2.12 Split-Group Stats Categorical

library(summarytools)

grouped_freqs <- stby(data = mydata$Smoker, INDICES = mydata$Sex, FUN = freq, cumul = FALSE, 
    report.nas = FALSE)

grouped_freqs %>% tb(order = 2)

4.1.2.13 Grouped Categorical

summarytools::stby(list(x = mydata$LVI, y = mydata$LymphNodeMetastasis), mydata$PNI, 
    summarytools::ctable)
with(mydata, summarytools::stby(list(x = LVI, y = LymphNodeMetastasis), PNI, summarytools::ctable))
SmartEDA::ExpCTable(mydata, Target = "Sex", margin = 1, clim = 10, nlim = NULL, round = 2, 
    bin = 4, per = F)
               VARIABLE      CATEGORY Sex:Female Sex:Male Sex:NA TOTAL
1                  Race         Asian         10        5      0    15
2                  Race     Bi-Racial          2        2      0     4
3                  Race         Black         11       15      0    26
4                  Race      Hispanic         22       17      0    39
5                  Race            NA          0        1      0     1
6                  Race        Native          1        1      0     2
7                  Race         Other          1        0      0     1
8                  Race         White         76       85      1   162
9                  Race         TOTAL        123      126      1   250
10 PreinvasiveComponent        Absent         91       94      1   186
11 PreinvasiveComponent            NA          1        0      0     1
12 PreinvasiveComponent       Present         31       32      0    63
13 PreinvasiveComponent         TOTAL        123      126      1   250
14                  LVI        Absent         84       72      1   157
15                  LVI            NA          0        1      0     1
16                  LVI       Present         39       53      0    92
17                  LVI         TOTAL        123      126      1   250
18                  PNI        Absent         78       93      0   171
19                  PNI            NA          0        0      1     1
20                  PNI       Present         45       33      0    78
21                  PNI         TOTAL        123      126      1   250
22                Group       Control         71       60      0   131
23                Group            NA          1        0      0     1
24                Group     Treatment         51       66      1   118
25                Group         TOTAL        123      126      1   250
26                Grade             1         36       47      0    83
27                Grade             2         37       27      1    65
28                Grade             3         50       51      0   101
29                Grade            NA          0        1      0     1
30                Grade         TOTAL        123      126      1   250
31               TStage             1         14       18      0    32
32               TStage             2         25       25      0    50
33               TStage             3         28       38      0    66
34               TStage             4         56       44      1   101
35               TStage            NA          0        1      0     1
36               TStage         TOTAL        123      126      1   250
37  LymphNodeMetastasis        Absent         76       76      1   153
38  LymphNodeMetastasis            NA          1        0      0     1
39  LymphNodeMetastasis       Present         46       50      0    96
40  LymphNodeMetastasis         TOTAL        123      126      1   250
41          Grade_Level          high         51       48      1   100
42          Grade_Level           low         35       42      0    77
43          Grade_Level      moderate         36       36      0    72
44          Grade_Level            NA          1        0      0     1
45          Grade_Level         TOTAL        123      126      1   250
46            DeathTime MoreThan1Year         44       57      0   101
47            DeathTime   Within1Year         79       69      1   149
48            DeathTime         TOTAL        123      126      1   250
49     Anti-X-intensity             1          7       13      0    20
50     Anti-X-intensity             2         54       47      1   102
51     Anti-X-intensity             3         61       66      0   127
52     Anti-X-intensity            NA          1        0      0     1
53     Anti-X-intensity         TOTAL        123      126      1   250
54     Anti-Y-intensity             1         36       32      0    68
55     Anti-Y-intensity             2         50       48      0    98
56     Anti-Y-intensity             3         36       46      1    83
57     Anti-Y-intensity            NA          1        0      0     1
58     Anti-Y-intensity         TOTAL        123      126      1   250
mydata %>% select(characterVariables) %>% select(PreinvasiveComponent, PNI, LVI) %>% 
    reactable::reactable(data = ., groupBy = c("PreinvasiveComponent", "PNI"), columns = list(LVI = reactable::colDef(aggregate = "count")))

4.1.3 Continious Variables

questionr:::icut()
source(here::here("R", "gc_desc_cont.R"))

Descriptive Statistics Age

mydata %>% jmv::descriptives(data = ., vars = "Age", hist = TRUE, dens = TRUE, box = TRUE, 
    violin = TRUE, dot = TRUE, mode = TRUE, sd = TRUE, variance = TRUE, skew = TRUE, 
    kurt = TRUE, quart = TRUE)

 DESCRIPTIVES

 Descriptives                       
 ────────────────────────────────── 
                          Age       
 ────────────────────────────────── 
   N                          249   
   Missing                      1   
   Mean                      50.4   
   Median                    50.0   
   Mode                      72.0   
   Standard deviation        14.1   
   Variance                   198   
   Minimum                   25.0   
   Maximum                   73.0   
   Skewness               -0.0364   
   Std. error skewness      0.154   
   Kurtosis                 -1.17   
   Std. error kurtosis      0.307   
   25th percentile           38.0   
   50th percentile           50.0   
   75th percentile           63.0   
 ────────────────────────────────── 

Descriptive Statistics Anti-X-intensity

mydata %>% jmv::descriptives(data = ., vars = "Anti-X-intensity", hist = TRUE, dens = TRUE, 
    box = TRUE, violin = TRUE, dot = TRUE, mode = TRUE, sd = TRUE, variance = TRUE, 
    skew = TRUE, kurt = TRUE, quart = TRUE)

 DESCRIPTIVES

 Descriptives                                
 ─────────────────────────────────────────── 
                          Anti-X-intensity   
 ─────────────────────────────────────────── 
   N                                   249   
   Missing                               1   
   Mean                               2.43   
   Median                             3.00   
   Mode                               3.00   
   Standard deviation                0.638   
   Variance                          0.407   
   Minimum                            1.00   
   Maximum                            3.00   
   Skewness                         -0.672   
   Std. error skewness               0.154   
   Kurtosis                         -0.535   
   Std. error kurtosis               0.307   
   25th percentile                    2.00   
   50th percentile                    3.00   
   75th percentile                    3.00   
 ─────────────────────────────────────────── 

Descriptive Statistics Anti-Y-intensity

mydata %>% jmv::descriptives(data = ., vars = "Anti-Y-intensity", hist = TRUE, dens = TRUE, 
    box = TRUE, violin = TRUE, dot = TRUE, mode = TRUE, sd = TRUE, variance = TRUE, 
    skew = TRUE, kurt = TRUE, quart = TRUE)

 DESCRIPTIVES

 Descriptives                                
 ─────────────────────────────────────────── 
                          Anti-Y-intensity   
 ─────────────────────────────────────────── 
   N                                   249   
   Missing                               1   
   Mean                               2.06   
   Median                             2.00   
   Mode                               2.00   
   Standard deviation                0.778   
   Variance                          0.605   
   Minimum                            1.00   
   Maximum                            3.00   
   Skewness                         -0.105   
   Std. error skewness               0.154   
   Kurtosis                          -1.34   
   Std. error kurtosis               0.307   
   25th percentile                    1.00   
   50th percentile                    2.00   
   75th percentile                    3.00   
 ─────────────────────────────────────────── 

tab <- tableone::CreateTableOne(data = mydata)
# ?print.ContTable
tab$ContTable
                              
                               Overall      
  n                            250          
  Age (mean (SD))              50.39 (14.06)
  Anti-X-intensity (mean (SD))  2.43 (0.64) 
  Anti-Y-intensity (mean (SD))  2.06 (0.78) 
print(tab$ContTable, nonnormal = c("Anti-X-intensity"))
                                 
                                  Overall           
  n                               250               
  Age (mean (SD))                 50.39 (14.06)     
  Anti-X-intensity (median [IQR])  3.00 [2.00, 3.00]
  Anti-Y-intensity (mean (SD))     2.06 (0.78)      
mydata %>% explore::describe(Age)
variable = Age
type     = double
na       = 1 of 250 (0.4%)
unique   = 50
min|max  = 25 | 73
q05|q95  = 28 | 72
q25|q75  = 38 | 63
median   = 50
mean     = 50.38956
mydata %>% select(continiousVariables) %>% SmartEDA::ExpNumStat(data = ., by = "A", 
    gp = NULL, Qnt = seq(0, 1, 0.1), MesofShape = 2, Outlier = TRUE, round = 2)
inspectdf::inspect_num(mydata, breaks = 10)
# A tibble: 3 x 10
  col_name        min    q1 median  mean    q3   max     sd pcnt_na hist        
  <chr>         <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <named list>
1 Age              25    38     50 50.4     63    73 14.1       0.4 <tibble [12…
2 Anti-X-inten…     1     2      3  2.43     3     3  0.638     0.4 <tibble [12…
3 Anti-Y-inten…     1     1      2  2.06     3     3  0.778     0.4 <tibble [12…
inspectdf::inspect_num(mydata)$hist$Age
# A tibble: 27 x 2
   value         prop
   <chr>        <dbl>
 1 [-Inf, 24) 0      
 2 [24, 26)   0.00803
 3 [26, 28)   0.0402 
 4 [28, 30)   0.0201 
 5 [30, 32)   0.0442 
 6 [32, 34)   0.0321 
 7 [34, 36)   0.0442 
 8 [36, 38)   0.0402 
 9 [38, 40)   0.0482 
10 [40, 42)   0.0361 
# … with 17 more rows
inspectdf::inspect_num(mydata, breaks = 10) %>% inspectdf::show_plot()

4.1.3.1 Split-Group Stats Continious

grouped_descr <- summarytools::stby(data = mydata, INDICES = mydata$Sex, FUN = summarytools::descr, 
    stats = "common")
# grouped_descr %>% summarytools::tb(order = 2)
grouped_descr %>% summarytools::tb()

4.1.3.2 Grouped Continious

summarytools::stby(data = mydata, INDICES = mydata$PreinvasiveComponent, FUN = summarytools::descr, 
    stats = c("mean", "sd", "min", "med", "max"), transpose = TRUE)
with(mydata, summarytools::stby(Age, PreinvasiveComponent, summarytools::descr), 
    stats = c("mean", "sd", "min", "med", "max"), transpose = TRUE)
mydata %>% group_by(PreinvasiveComponent) %>% summarytools::descr(stats = "fivenum")
## Summary statistics by – category
SmartEDA::ExpNumStat(mydata, by = "GA", gp = "PreinvasiveComponent", Qnt = seq(0, 
    1, 0.1), MesofShape = 2, Outlier = TRUE, round = 2)
  Vname                        Group  TN nNeg nZero nPos NegInf PosInf NA_Value
1   Age     PreinvasiveComponent:All 250    0     0  249      0      0        1
2   Age  PreinvasiveComponent:Absent 186    0     0  185      0      0        1
3   Age PreinvasiveComponent:Present  63    0     0   63      0      0        0
4   Age      PreinvasiveComponent:NA   0    0     0    0      0      0        0
  Per_of_Missing   sum min  max  mean median    SD   CV  IQR Skewness Kurtosis
1           0.40 12547  25   73 50.39     50 14.06 0.28 25.0    -0.04    -1.17
2           0.54  9357  25   73 50.58     50 14.16 0.28 24.0    -0.05    -1.18
3           0.00  3150  25   73 50.00     50 13.89 0.28 22.5    -0.04    -1.16
4            NaN     0 Inf -Inf   NaN     NA    NA   NA   NA      NaN      NaN
  0%  10% 20%  30%  40% 50%  60%  70%  80% 90% 100% LB.25% UB.75% nOutliers
1 25 31.0  36 40.4 46.2  50 54.8 60.0 65.0  70   73   0.50 100.50         0
2 25 31.0  36 42.0 46.6  50 55.0 60.8 65.0  70   73   3.00  99.00         0
3 25 31.4  36 39.2 46.8  50 54.2 60.0 63.8  69   73   4.25  94.25         0
4 NA   NA  NA   NA   NA  NA   NA   NA   NA  NA   NA     NA     NA         0

4.2 Survival Analysis

Codes for Survival Analysis24

  • Survival analysis with strata, clusters, frailties and competing risks in in Finalfit

https://www.datasurg.net/2019/09/12/survival-analysis-with-strata-clusters-frailties-and-competing-risks-in-in-finalfit/

  • Intracranial WHO grade I meningioma: a competing risk analysis of progression and disease-specific survival

https://link.springer.com/article/10.1007/s00701-019-04096-9

Calculate survival time

mydata$int <- lubridate::interval(lubridate::ymd(mydata$SurgeryDate), lubridate::ymd(mydata$LastFollowUpDate))
mydata$OverallTime <- lubridate::time_length(mydata$int, "month")
mydata$OverallTime <- round(mydata$OverallTime, digits = 1)

recode death status outcome as numbers for survival analysis

## Recoding mydata$Death into mydata$Outcome
mydata$Outcome <- forcats::fct_recode(as.character(mydata$Death), `1` = "TRUE", `0` = "FALSE")
mydata$Outcome <- as.numeric(as.character(mydata$Outcome))

it is always a good practice to double-check after recoding25

table(mydata$Death, mydata$Outcome)
       
          0   1
  FALSE  73   0
  TRUE    0 176

4.2.1 Kaplan-Meier

library(survival)
# data(lung) km <- with(lung, Surv(time, status))
km <- with(mydata, Surv(OverallTime, Outcome))
head(km, 80)
 [1] 10.0   9.7   3.3  11.0  10.0+  3.7   5.4  11.2+ 10.1   7.6+ 10.2+  4.0 
[13] 10.7   9.8   8.3   8.8   4.9   7.3?  6.6   4.7  10.5+  6.6   6.3  10.7 
[25]  5.2+ 10.9   8.5   9.8   7.1   7.8   9.5  11.6+  7.8   5.3+  4.5+  4.7 
[37]  6.7   9.5+  5.4  10.4   6.3   6.3   5.4   5.0+  3.3   3.5  11.8+  9.8 
[49]  5.9   5.9  10.0   5.3   6.0   9.7   8.0+  8.9   3.2   7.1   7.3+  5.3 
[61]  3.8   5.7   5.9+  3.3  11.8   6.5   3.4+ 11.2   8.9  11.1+  6.4+  9.2+
[73]  7.0   8.9   6.2   7.9   9.0+  5.6+  7.6+ 11.0 
plot(km)

Kaplan-Meier Plot Log-Rank Test

# Drawing Survival Curves Using ggplot2
# https://rpkgs.datanovia.com/survminer/reference/ggsurvplot.html
dependentKM <- "Surv(OverallTime, Outcome)"
explanatoryKM <- "LVI"

mydata %>%
  finalfit::surv_plot(.data = .,
                      dependent = dependentKM,
                      explanatory = explanatoryKM,
                      xlab='Time (months)',
                      pval=TRUE,
                      legend = 'none',
                      break.time.by = 12,
                      xlim = c(0,60)
                      # legend.labs = c('a','b')
                      )

# Drawing Survival Curves Using ggplot2
# https://rpkgs.datanovia.com/survminer/reference/ggsurvplot.html

mydata %>%
  finalfit::surv_plot(.data = .,
                      dependent = "Surv(OverallTime, Outcome)",
                      explanatory = "LVI",
                      xlab='Time (months)',
                      pval=TRUE,
                      legend = 'none',
                      break.time.by = 12,
                      xlim = c(0,60)
                      # legend.labs = c('a','b')
                      )

4.2.2 Univariate Cox-Regression

library(finalfit)
library(survival)
explanatoryUni <- "LVI"
dependentUni <- "Surv(OverallTime, Outcome)"

tUni <- mydata %>% finalfit::finalfit(dependentUni, explanatoryUni)

knitr::kable(tUni, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r"))
Dependent: Surv(OverallTime, Outcome) all HR (univariable) HR (multivariable)
LVI Absent 157 (100.0) NA NA
Present 92 (100.0) 2.02 (1.47-2.78, p<0.001) 2.02 (1.47-2.78, p<0.001)
tUni_df <- tibble::as_tibble(tUni, .name_repair = "minimal") %>% janitor::clean_names()

tUni_df_descr <- paste0("When ", tUni_df$dependent_surv_overall_time_outcome[1], 
    " is ", tUni_df$x[2], ", there is ", tUni_df$hr_univariable[2], " times risk than ", 
    "when ", tUni_df$dependent_surv_overall_time_outcome[1], " is ", tUni_df$x[1], 
    ".")

When LVI is Present, there is 2.02 (1.47-2.78, p<0.001) times risk than when LVI is Absent.

4.2.3 Kaplan-Meier Median Survival

km_fit <- survfit(Surv(OverallTime, Outcome) ~ LVI, data = mydata)
km_fit
Call: survfit(formula = Surv(OverallTime, Outcome) ~ LVI, data = mydata)

   3 observations deleted due to missingness 
              n events median 0.95LCL 0.95UCL
LVI=Absent  157    111   22.6    15.3    29.4
LVI=Present  90     64    9.8     8.7    13.3
plot(km_fit)

# summary(km_fit)
km_fit_median_df <- summary(km_fit)
km_fit_median_df <- as.data.frame(km_fit_median_df$table) %>% janitor::clean_names() %>% 
    tibble::rownames_to_column()
km_fit
broom::tidy(km_fit)
km_fit_median_definition <- km_fit_median_df %>% dplyr::mutate(description = glue::glue("When {rowname}, median survival is {median} [{x0_95lcl} - {x0_95ucl}, 95% CI] months.")) %>% 
    dplyr::select(description) %>% pull()

When LVI=Absent, median survival is 22.6 [15.3 - 29.4, 95% CI] months., When LVI=Present, median survival is 9.8 [8.7 - 13.3, 95% CI] months.

4.2.4 1-3-5-yr survival

summary(km_fit, times = c(12, 36, 60))
Call: survfit(formula = Surv(OverallTime, Outcome) ~ LVI, data = mydata)

3 observations deleted due to missingness 
                LVI=Absent 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   12     80      54    0.623  0.0408        0.548        0.708
   36     21      42    0.241  0.0412        0.172        0.337

                LVI=Present 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   12     17      47   0.3630  0.0597       0.2629        0.501
   36      2      15   0.0427  0.0292       0.0112        0.163
km_fit_summary <- summary(km_fit, times = c(12, 36, 60))

km_fit_df <- as.data.frame(km_fit_summary[c("strata", "time", "n.risk", "n.event", 
    "surv", "std.err", "lower", "upper")])
km_fit_definition <- km_fit_df %>% dplyr::mutate(description = glue::glue("When {strata}, {time} month survival is {scales::percent(surv)} [{scales::percent(lower)}-{scales::percent(upper)}, 95% CI].")) %>% 
    dplyr::select(description) %>% pull()

When LVI=Absent, 12 month survival is 62% [54.8%-71%, 95% CI]., When LVI=Absent, 36 month survival is 24% [17.2%-34%, 95% CI]., When LVI=Present, 12 month survival is 36% [26.3%-50%, 95% CI]., When LVI=Present, 36 month survival is 4% [1.1%-16%, 95% CI].

4.2.5 Pairwise comparison

dependentKM <- "Surv(OverallTime, Outcome)"
explanatoryKM <- "TStage"

mydata %>%
  finalfit::surv_plot(.data = .,
                      dependent = dependentKM,
                      explanatory = explanatoryKM,
                      xlab='Time (months)',
                      pval=TRUE,
                      legend = 'none',
                      break.time.by = 12,
                      xlim = c(0,60)
                      # legend.labs = c('a','b')
                      )

4.2.6 Multivariate Analysis Survival



5 Discussion

  • Interpret the results in context of the working hypothesis elaborated in the introduction and other relevant studies; include a discussion of limitations of the study.

  • Discuss potential clinical applications and implications for future research

References

Knijn, N., F. Simmer, and I. D. Nagtegaal. 2015. “Recommendations for Reporting Histopathology Studies: A Proposal.” Virchows Archiv 466 (6): 611–15. https://doi.org/10.1007/s00428-015-1762-3.

Schmidt, Robert L., Deborah J. Chute, Jorie M. Colbert-Getz, Adolfo Firpo-Betancourt, Daniel S. James, Julie K. Karp, Douglas C. Miller, et al. 2017. “Statistical Literacy Among Academic Pathologists: A Survey Study to Gauge Knowledge of Frequently Used Statistical Tests Among Trainees and Faculty.” Archives of Pathology & Laboratory Medicine 141 (2): 279–87. https://doi.org/10.5858/arpa.2016-0200-OA.


  1. From Table 1: Proposed items for reporting histopathology studies. Recommendations for reporting histopathology studies: a proposal Virchows Arch (2015) 466:611–615 DOI 10.1007/s00428-015-1762-3↩︎

  2. From Table 1: Proposed items for reporting histopathology studies. Recommendations for reporting histopathology studies: a proposal Virchows Arch (2015) 466:611–615 DOI 10.1007/s00428-015-1762-3↩︎

  3. See childRmd/_01header.Rmd file for other general settings↩︎

  4. Change echo = FALSE to hide codes after knitting.↩︎

  5. See childRmd/_02fakeData.Rmd file for other codes↩︎

  6. Synthea The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures. BMC Med Inform Decis Mak 19, 44 (2019) doi:10.1186/s12911-019-0793-0↩︎

  7. https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-0793-0↩︎

  8. Synthetic Patient Generation↩︎

  9. Basic Setup and Running↩︎

  10. intelligent patient data generator (iPDG)↩︎

  11. https://medium.com/free-code-camp/how-our-test-data-generator-makes-fake-data-look-real-ace01c5bde4a↩︎

  12. https://forums.librehealth.io/t/demo-data-generation/203↩︎

  13. https://mihin.org/services/patient-generator/↩︎

  14. lung, cancer, breast datası ile birleştir↩︎

  15. See childRmd/_03importData.Rmd file for other codes↩︎

  16. See childRmd/_04briefSummary.Rmd file for other codes↩︎

  17. https://www.hhs.gov/hipaa/index.html↩︎

  18. Kişisel verilerin kaydedilmesi ve kişisel verileri hukuka aykırı olarak verme veya ele geçirme Türk Ceza Kanunu’nun 135. ve 136. maddesi kapsamında bizim hukuk sistemimizde suç olarak tanımlanmıştır. Kişisel verilerin kaydedilmesi suçunun cezası 1 ila 3 yıl hapis cezasıdır. Suçun nitelikli hali ise, kamu görevlisi tarafından görevin verdiği yetkinin kötüye kullanılarak veya belirli bir meslek veya sanatın sağladığı kolaylıktan yararlanılarak işlenmesidir ki bu durumda suçun cezası 1.5 ile 4.5 yıl hapis cezası olacaktır.↩︎

  19. See childRmd/_06variableTypes.Rmd file for other codes↩︎

  20. See childRmd/_07overView.Rmd file for other codes↩︎

  21. Statistical Literacy Among Academic Pathologists: A Survey Study to Gauge Knowledge of Frequently Used Statistical Tests Among Trainees and Faculty. Archives of Pathology & Laboratory Medicine: February 2017, Vol. 141, No. 2, pp. 279-287. https://doi.org/10.5858/arpa.2016-0200-OA↩︎

  22. From Table 1: Proposed items for reporting histopathology studies. Recommendations for reporting histopathology studies: a proposal Virchows Arch (2015) 466:611–615 DOI 10.1007/s00428-015-1762-3↩︎

  23. See childRmd/_11descriptives.Rmd file for other codes↩︎

  24. See childRmd/_18survival.Rmd file for other codes, and childRmd/_19shinySurvival.Rmd for shiny application↩︎

  25. JAMA retraction after miscoding – new Finalfit function to check recoding↩︎

  26. See childRmd/_23footer.Rmd file for other codes↩︎

  27. Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. (2016) Software Citation Principles. PeerJ Computer Science 2:e86. DOI: 10.7717/peerj-cs.86 https://www.force11.org/software-citation-principles↩︎

 

A work by Serdar Balci

drserdarbalci@gmail.com